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Abstract 

We present a generalization of conventional artificial 
neural networks that allows for a functional equivalence 
to multi-expert systems. The new model provides an ar- 
chitectural freedom going beyond existing multi-expert 
models and an integrative formalism to compare and 
combine various techniques of learning. (We consider 
gradient, EM, reinforcement, and unsupervised learn- 
ing.) Its uniform representation aims at a simple ge- 
netic encoding and evolutionary structure optimization 
of multi-expert systems. This paper contains a detailed 
description of the model and learning rules, empirically 
validates its functionality, and discusses future perspec- 
tives. 



I Introduction 

When using multi-expert architectures for modeling 
behavior or data, the motivation is the separation of 
the stimulus or data space into disjoint regimes one 
which separate models (experts) are applied (Jacobs 
1999; Jacobs, Jordan, & Barto 1990). The idea is that 
experts responsible for only a limited regime can be 
smaller and more efficient, and that knowledge from 
one regime should not be extrapolated onto another 
regime, i.e., optimization on one regime should not 
interfere with optimization on another. Several ar- 
guments indicate that this kind of adaptability can- 
not be realized by a single conventional neural net- 
work (Toussaint 2002). Roughly speaking, for con- 
ventional neural networks the optimization of a re- 
sponse in one regime always interferes with responses 
in other regimes because they depend on the same 
parameters (weights), which are not separated into 
disjoint experts. 

To realize a seperation of the stimulus space one 
could rely on the conventional way of implementing 
multi-experts, i.e., allow neural networks for the im- 
plementation of expert modules and use external, of- 
ten more abstract types of gating networks to orga- 
nize the interaction between these modules. Much 
research is done in this direction (Bengio & Frasconi 
1994; Cacciatore & Nowlan 1994; Jordan & Jacobs 
1994; Rahman & Fairhurst 1999; Ronco, Gollee, & 
Gawthrop 1997). The alternative we want to propose 
here is to introduce a neural model that is capable to 
represent systems that are functionally equivalent to 



multi-expert systems within a single integrative net- 
work. This network does not explicitly distinguish 
between expert and gating modules and generalizes 
conventional neural networks by introducing a coun- 
terpart for gating interactions. What is our moti- 
vation for such a new representation of multi-expert 
systems? 

• First, our representation allows much more and 
qualitatively new architectural freedom. E.g., 
gating neurons may interact with expert neu- 
rons; gating neurons can be a part of experts. 
There is no restriction with respect to serial, 
parallel, or hierarchical architectures — in a much 
more general sense as proposed in (Jordan & 
Jacobs 1994). 

• Second, our representation allows in an intu- 
itive way to combine techniques from various 
learning theories. This includes gradient de- 
scent, unsupervised learning methods like Hebb 
learning or the Oja rule, and an EM-algorithm 
that can be transferred from classical gating- 
learning theories (Jordan & Jacobs 1994). Fur- 
ther, the interpretation of a specific gating as 
an action exploits the realm of reinforcement 
learning, in particular Q-learning and (though 
not discussed here) its TD and TD(A) variants 
(Sutton & Barto 1998). 

• Third, our representation makes a simple ge- 
netic encoding of such architectures possible. 
There already exist various techniques for evo- 
lutionary structure optimization of networks (see 
(Yao 1999) for a review). Applied on our repre- 
sentation, they become techniques for the evo- 
lution of multi-expert architectures. 

After the rather straight-forward generalization of 
neural interactions necessary to realize gatings (sec- 
tion II), we will discuss in detail different learning 
methods in section III. The empirical study in sec- 
tion IV compares the different interactions and learn- 
ing mechanisms on a test problem similar to the one 
discussed by Jacobs et al. (1990). 
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II Model definition 

Conventional multi-expert systems. Assume the 
system has to realize a mapping from an input space 
X to an output space Y. Typically, an m-expert ar- 
chitecture consists of a gating function g : X — > 
[0, l] m and to expert functions fa : X — > Y which are 
combined by the softmax linear combination: 



y = ^2,9ifi{x), g % 



Em 



o 09j (x) 



(1) 



where x and y are input and output, and (3 describes 
the "softness" of this winner-takes-all type competi- 
tion between the experts, see Figure 1. The crucial 
question becomes how to train the gating. We will 
discuss different methods in the next section. 

Neural implementation of multi-experts. We 

present a single neural system that has at least the 
capabilities of a multi-expert architecture of several 
neural networks. Basically we provide additional com- 
petitive and gating interactions, for an illustration 
compare Figure 1 and Figure 2-B. More formally, we 
introduce the model as follows: 

The architecture is given by a directed, labeled 
graph of neurons (i) and links (ij) from (j) to (i), 
where i,j = l..n. Labels of links declare if they are 
ordinary, competitive or gating connections. Labels 
of neurons declare their type of activation function. 
With every neuron (i), an activation state (output 
value) Zi e [0, 1] is associated. A neuron (i) collects 
two terms of excitation xi and g% given by 

(2) 




where Wij,Wi <G 



iiNt = , N i = Tl, 

else J 
(3) 

are weights and bias associated 
with the links (ij) and the neuron (i), respectively. 
The second excitatory term gi has the meaning of 
a gating term and is induced by Ni ^-labeled links 
(ij) 9 - 

In case there are no c- labeled links (ij)° connected 
to a neuron (i), its state is given by 

Zi = 4>(xi)gi . (4) 

Here, <j> : K — > [0, 1] is a sigmoid function. This 
means, if a neuron (i) has no gating links (ij) 9 con- 
nected to it, then gi — 1 and the sigmoid <fi(xi) de- 
scribes its activation. Otherwise, the gating term gi 
multiplies to it. 
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Figure 1: Ordinary multi-expert architecture. Gat- 
ing and experts modules arc explicitly separated and 
the gating may not depend on internal states or the 
output of experts. 

Neurons (i) that are connected by (bi-directed) c- 
labeled links (ij) c form a competitive group in which 
only one of the neurons (the winner) acquires state 
^winner = 1 while the other's states are zero. Let {i} c 
denote the competitive group of neurons to which (i) 
belongs. On such a group, we introduce a normalized 
distribution y u J2je{i}c Vj = !j S iven b y 



1p(Xi) 



Xi 



E 

k£{iy 



1p(Xk) 



(5) 



Here, ip is some function '. 



tp(x) 



(e.g., the exponential 



). The neurons states Zj G {0, 1}, j G {i} c 



depend on this distribution yi by one of the following 
competitive rules of winner selection: We will con- 
sider a selection with probability proportional to yi 
(softmax), deterministic selection of the maximum yi, 
and e-greedy selection (where with probability e a 
random winner is selection instead of the maximum). 

Please see Figure 2 to get an impression of the ar- 
chitectural possibilities this representations provides. 
Example A realizes an ordinary feed-forward neural 
network, where the three output neurons form a com- 
petitive group. Thus, only one of the output neu- 
rons will return a value of 1, the others will return 
0. Example B realizes exactly the same multi-expert 
system as depicted in Figure 1. The two outputs 
of the central module form a competitive group and 
gate the output neurons of the left and right mod- 
ule respectively — the central module calculates the 
gating whereas the left and right modules are the ex- 
perts. Example C is an alternative way of designing 
multi-expert systems. Each expert module contains 
an additional output node which gates the rest of its 
outputs and competes with the gating nodes of the 
other experts. Thus, each expert estimates itself how 
good it can handle the current stimulus (see the Q- 
learning method described below). Finally, example 
D is a true hierarchical architecture. The two ex- 
perts on the left compete to give an output, which 
is further processed and, again, has to compete with 
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-0 = gating 
HI) = competition 





Figure 2: Sample architectures. Please see section II for a description of these architectures. 



the larger expert to the right. In contrast, Jordan & 
Jacobs (1994) describe an architecture where the cal- 
culation of one single gating (corresponding to only 
one competitive level) is organized in a hierarchical 
manner. Here, several gatings on different levels can 
be combined in any successive, hierarchical way. 

Ill Learning 

In this section we introduce four different learning 
methods, each of which is applicable independent of 
the specific architecture. We generally assume that 
the goal is to approximate training data given as pairs 
(x, t) of stimulus and target output value. 

The gradient method. To calculate the gradient, 
we assume that selection in competitive groups per- 
formed with probability proportional to the distribu- 
tion t/j. We calculate an approximate gradient of the 
conditional probability l ?(y\x) that this system rep- 
resents by replacing the actual state in Eq. (2) by 
its expectation value j/j for neurons in competitive 
groups (see also Neal 1990). For the simplicity of no- 
tation, we identify z% = yt- Then, for a neuron (i) in a 
competitive group obeying Eq. (5), we get the partial 
derivatives of the neuron's output with respect to its 
excitations: 



dxj 



Xi 



{Xi? 



(6) 
(7) 



where Sj^uxc — 1 iff j is a member of {i} c - Let 
E = E(z\,..,z n ) be an error functional. We write 
the delta-rule for back-propagation by using the no- 
tations 5i = |f and (Si = S? = ||) for the 
gradients at a neuron's output and excitations, re- 
spectively, and a = ^ for the local error of a single 



(output) neuron. From Eqs. (2,3,6,7) we get 

dE dgj 



5,.= 



dE 

dzi 



EdE dxj 



Si = ^~ 



dE_ 

dxi 



dgj dzi 



ei + 2^ dx dz ] + T,' 

3 3 

e t + J2S jWjl + J2 6 3W < ( 8 ) 

Ej dzj ip'(xi) r ^ - , 
3 je{i} c 

(9) 



S? 



dE 

dgi 



E^ = °- 



(10) 



(In Eq. (9) we used Xi = Xj for i e {j} c and i e 
{j} c j S {*} c .) For neurons that do not join a 
competitive group we get from Eq. (4) 



dz ■ dz ' 

= (p'(xi) gi % , — — = (p(xi) S i:j 



dx 



Si 



dE 

dxi 



E* 



dzj 
1 9xj 



<p'(xi)giSi , 



(ii) 

(12) 
(13) 



where Si is given in Eq. (8). The final gradients are 
dE _ dE 



dwi 



St, 



dwi 



= Si Zj . 



(14) 



The choice of the error functional is free. E.g., 
it can be chosen as the square error E = — 
ti) 2 , ei = 2(zi — ti) or as the log-likelihood E = 

l n Ili z t i ( 1 ~ z i ) *S e i — 7~ ~ T^t ' wnere m the latter 
case the target are states ti e {0, 1}. 



The basis for further learning rules. For the 

following learning methods we concentrate on the ques- 
tion: What target values should we assume for the 
states of neurons in a competitive group ? In the case 
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of gradient descent, Eq. (8) gives the answer. It actu- 
ally describes a linear projection of the desired output 
variance down to all system states Zi — including those 
in competitions. In fact, all the following learning 
methods will adopt the above gradient descent rules 
except for a redefinition of Si (or alternatively Si) in 
the case of neurons (i) in competitive groups. This 
means that neurons "below" competitive groups are 
adapted by ordinary gradient descent while the local 
error at competitive neurons is given by other rules 
than gradient descent. Actually this is the usual way 
for adapting systems where neural networks are used 
as internal modules and trained by back-propagation 
(e.g., see Anderson & Hong 1994). 

An EM-algorithm. We briefly review the basic 
ideas of applying an EM-algorithm on the problem of 
learning gatings in multi-experts (Jordan & Jacobs 
1994). The algorithm is based on an additional, very 
interesting assumption: Let the outcome of a com- 
petition in a competitive group {c} be described by 
the states Zi G {0, 1}, X)ie{ c } 2» = 1 of the neurons 
that join this group. Now, we assume that there ex- 
ists a correct outcome hi £ {0,1}, J2ie{c} = 1- 
Formally, this means to assume that the complete 
training data are triplets (x, hi,t) of stimuli, compe- 
tition states, and output values. 1 However, the com- 
petition training data is unobservable or hidden and 
must be inferred by statistical means. Bayes' rule 
gives an answer on how to infer an expectation of the 
hidden training data hi and lays the ground for an 
EM-algorithm. The consequence of this assumption 
is that now the j/, of competitive neurons are sup- 
posed to approximate this expectation of the training 
data hi instead of being free. For simplification, let 
us concentrate on a network containing a single com- 
petitive group; the generalization is straightforward. 

• Our system represents the conditional probabil- 
ity of output states z° and competition states 
z c , depending on the stimulus x and parameters 



[Wi 



,Wi 



7(z°, z c \x, 0) = 3>(z c |x, 6>) 7(z°\z c , x, 0) . 

(15) 

• (E-step) We use Bayes rule to infer the expected 
competition training data hi hidden in a train- 



1 More precisely, the assumption is that there exists a 
teacher system of same architecture as our system. Our system 
adapts free parameters wij , Wi in order to approximate this 
teacher system. The teacher system produces training data 
and, since it has the same architecture as ours, also uses com- 
petitive groups to generate this data. The training data would 
be complete if it included the outcomes of these competitions. 



ing tuple (x, -,t), i.e., the probability of hi when 
x and t are given. 



9(hi\x,t) 



J'ft|^,x)J'(fe z |x) 
3>(t|ar) 



(16) 



Since these probabilities refer to the training 
(or teacher) system, we can only approximate 
them. We do this by our current approxima- 
tion, i.e., our current system: 



?(hi\x,t,6) 



V(t\hi,x,O)9(hi\x,0) 
3>(t\x,6) 
y(t\hj,x,6) y{hj\x,6) 

(17) 



• (M-step) We can now adapt our system. In the 
classical EM-algorithm, this amounts to maxi- 
mizing the expectation of the log-likelihood (cp. 
Eq. (15)) 

E[l{6')] = E[\n ?(h\x, 6') + In 7>(t\z c , x, 6')] 

(18) 

where the expectation is with respect to the dis- 
tribution y(h\x, t, 8) of /i-values (i.e., depending 
on our inference of the hidden states h); and the 
maximization is with respect to parameters 0. 
This equation can be simplified further — but, 
very similar to the "least-square" algorithm de- 
veloped by Jordan & Jacobs (1994), we are sat- 
isfied to have inferred an explicit desired proba- 
bility yi = CP(/i, = l\x,t,0) for the competition 
states Zi that we use to define a mean-square er- 
ror and perform an ordinary gradient descent. 

Based on this background we define the learning 
rule as follows and with some subtle differences to the 
one presented in (Jordan & Jacobs 1994). Equation 
(17) defines the desired probability yi of the states 
Zi. Since we assume a selection rule proportional to 
the distribution yi, the values i/i are actually target 
values for the distribution y,. The first modification 
we propose is to replace all likelihood measures in- 
volved in Eq. (17) by general error measures E: Let 
us define 



Qi(x) := 1 - E(x) if (i) wins. 



(19) 



Then, in the case of the likelihood error E(x) = 
1 - y(t\x,9), we retrieve Q t {x) = ?(t\hi = l,x,6). 
Further, let 



V(x) :=^Q l (x)y J (x). 



(20) 
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By these definitions we may rewrite Eq. (17) as 

Q i (x)y l (x) Qi{x)y l (x) 



Vi{x) = 



V(x) 



J2iQj(x)yj{x) ' 



(21) 



However, this equation needs some discussion with 
respect to its explicit calculation in our context — 
leading to the second modification. Calculating Qj (x) 
for every j amounts to evaluating the system for every 
possible competition outcome. One major difference 
to the algorithm presented in (Jordan & Jacobs 1994) 
is that we do not allow for such a separated evaluation 
of all experts in a single time step. In fact, this would 
be very expensive in case of hierarchically interacting 
competitions and experts because the network had to 
be evaluated for each possible combinatorial state of 
competition outcomes. Thus we propose to use an 
approximation: We replace Qj(x) by its average over 
the recent history of cases where (j) won the compe- 
tition, 

Qj <— 7 Qj • + (1 — 7) Qj( x ) whenever (j) wins 

(22) 

where 7 € [0, 1] is a trace constant (as a simplification 
of the time dependent notation, we use the algorith- 
mic notation for a replacement if and only if (j) 
wins). Hence, our adaptation rule finally reads 



-a c 



QiVi 



Qj Vj 



if (i) wins, 
(23) 



and Si — if (i) does not win; which means a gradi- 
ent descent on the squared error between the approx- 
imated desired probabilities yt and the distribution 

y%- 

Q-learning. Probably, the reader has noticed that 
we chose the notations in the previous section in the 
style of reinforcement learning: If one interprets the 
winning of neuron (i) as a decision on an action, then 
Qi(x) (called action-value function) describes the (es- 
timated) quality of taking this decision for stimu- 
lus x; whereas V{x) (called state-value function) de- 
scribes the estimated quality for stimulus x without 
having decided yet, see (Sutton & Barto 1998). In 
this context, Eq. (21) is very interesting: it proposes 
to adapt the probability yi(x) according to the ratio 
Qi(x) /V(x) — the EM-algorithm acquires a very intu- 
itive interpretation. To realize this equation without 
the approximation described above one has to pro- 
vide an estimation of V(x), e.g., a neuron trained on 
this target value (a critic). We leave this for future 



- The adaptation rate is a = 0.01 for all algorithms (as 
indicated in Eqs. (23,24,26), the delta-values for neurons in 
competitive groups are multiplied by a c ). 

- Parameters are initialized normally distributed around 
zero with standard deviation a = 0.01. 

- The sigmoidal and linear activation functions are <f> s (x) = 

l+cxpt-lite) and = x ' res P ectivel y- 

- The competition function ip for softmax competition is 
4>s(x) = e 5x . 

- The Q-lcarning algorithm uses e-greedy selection with e = 
0.1; the others select cither the maximal activation or with 
probability proportional to the activation. 

- The values of the average traces Qi and V are initialized 
to 1. 

- The following parameters were used for the different learn- 
ing schemes: 





gradient 


EM 


Q 


Oja-Q 






1 


10 


100 


7 




0.9 




0.9 


</> 


Ips 


4>s 




4>i 


selection 


proportional 


max 


greedy 


max 



Here, a c is the learning rate factor, 7 is the average trace 
parameter, and ip is the competition function. 



Table 1: Implementation details 

research and instead directly address the Q-learning 
paradigm. 

For Q-learning, an explicit estimation of the action- 
values Qi(x) is modeled. In our case, we realize this 
by considering Qi(x) as the target value of the exci- 
tations Xi, i S {c}, i.e., we train the excitations of 
competing neurons toward the action values, 



Si = a c 



Xi — Qi if (i) wins 
else 



(24) 



This approach seems very promising — in particular, 
it opens the door to temporal difference and TD(A) 
methods and other fundamental concepts of reinforce- 
ment learning theory. 

Oja-Q learning. Besides statistical and reinforce- 
ment learning theories, also the branch of unsuper- 
vised learning theories gives some inspiration for our 
problem. The idea of hierarchically, serially coupled 
competitive groups raises a conceptual problem: Can 
competitions in areas close to the input be trained 
without functioning higher level areas (closer to the 
output) and vice versa? Usually, back-propagation is 
the standard technique to address this problem. But 
this does not apply on either the EM-learning or the 
reinforcement learning approaches because they gen- 
erate a direct feedback to competing neurons in any 
layer. Unsupervised learning in lower areas seems 
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to point a way out of this dilemma. As a first ap- 
proach we propose a mixture of unsupervised learn- 
ing in the fashion of the normalized Hebb rule and 
Q-learning. The normalized Hebb rule (of which the 
Oja rule is a linearized version) can be realized by 
setting Si = —a c Zi for a neuron (i) in a competi- 
tive group (recall Zi <E {0,1}). The gradient descent 
with respect to adjacent input links gives the ordinary 



Aw, 



rule. Thereafter, the input weights (in- 



cluding the bias) of each neuron (i), i e {c} are nor- 
malized. We modify this rule in two respects. First, 
we introduce a factor (Qi — V) that accounts for the 
success of neuron (i) being the winner. Here, V is an 
average trace of the feedback: 



V <— 7 V + (1 — 7) Qi(x) every time step 



(25) 



where (i) is the winner. Second, in the case of fail- 
ure, Qi < V, we also adapt the non-winners in order 
to increase their response on the stimulus next time. 
Thus, our rule reads 



St = -a c (Qi - V) 



Zi 

Zi - 0.5 



if > V 
else 



(26) 



Similar modifications are often proposed in rein- 
forcement learning models (Barto & Anandan 1985; 
Barto & Jordan 1987). The rule investigated here is 
only a first proposal; all rules presented in the ex- 
cellent survey of Diamantaras & Kung (1996) can 
equally be applied and are of equal interest but have 
not yet been implemented by the author. 

IV Empirical study 

We test the functionality of our model and the learn- 
ing rules by addressing a variant of the test presented 
in (Jacobs, Jordan, & Barto 1990). A single bit of an 
8-bit input decides on the subtask that the system has 
to solve on the current input. The subtasks itself are 
rather simple and in our case (unlike in (Jacobs, Jor- 
dan, & Barto 1990)) are to map the 8-bit input either 
identically or inverted on the 8-bit output. The task 
has to be learned online. We investigate the learning 
dynamics of a conventional feed-forward neural net- 
work (FFNN) and of our model with the 4 different 
learning methods. We use a fixed architecture similar 
to an 8-10-8-layered network with 10 hidden neurons 
but additionally install 2 competitive neurons that 
receive the input and each gates half of the hidden 
neurons, see Figure 3. In the case of the conventional 
FFNN we used the same architecture but replaced all 




Figure 3: The architecture we use for our exper- 
iments. All output neurons have linear activation 
functions cf>(x) = x. All except the input neurons 
have bias terms. 



gating and competitive connections by conventional 
links. 

Figure 4 displays the learning curves averaged over 
20 runs with different weight initializations. For im- 
plementation details see Table 1. First of all, we find 
that all of the 4 learning methods perform well on this 
task compared to the conventional FFNN. The curves 
can best be interpreted by investigating if a task sep- 
aration has been learned. Figure 5 displays the fre- 
quencies of winning of the two competitive neurons 
in case of the different subtasks. The task separation 
would be perfect if these two neurons would reliably 
distinguish the two subtasks. First noticeable is that 
all 4 learning methods learn the task separation. In 
the case of Q-learning the task separation is found 
rather late and remains noisy because of the e-greedy 
selection used. This explains its slower learning curve 
in Figure 4. EM and Oja-Q realize strict task separa- 
tions (maximum selection), for the gradient method 
it is still a little noisy (softmax selection). It is clear 
that, if the task separation has been found and fixed, 
all four learning methods proceed equivalently. So 
it is no surprise that the learning curves in Figure 
4 are very similar except for a temporal offset cor- 
responding to the time until the task separation has 
been found, and the non-zero asymptotic error corre- 
sponding to the noise of task separation. (Note that 
Figure 5 represents only a single, typical trial.) 

Generally, our experience was that the learning 
curves may look very different depending on the weight 
initialization. It also happened that the task separa- 
tion was not found when weights and biases (espe- 
cially of the competing neurons) are initialized very 
large (by N(0,0.5)). One of the competitive neurons 
then dominates from the very beginning and prohibits 
the "other expert" to adapt in any way. Definitely, 
a special, perhaps equal initialization of competitive 
neurons could be profitable. 

Finally, also the conventional FFNN only some- 
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Figure 4: Learning curves for the conventional neural 
network and the four different learning schemes. 

times solves the task completely — more often when 
weights are initialized relatively high. This explains 
the rather high error offset for its learning curve. 

V Conclusion 

We generalized conventional neural networks to al- 
low for multi-expert like interactions. We introduced 
4 different learning methods for this model and gave 
empirical support for their functionality. What makes 
the model particularly interesting is: 

f . The generality of our representation of system 
architecture allows new approaches for the struc- 
ture optimization of multi-expert systems, in- 
cluding arbitrary serial, parallel, and hierarchi- 
cal architectures. In particular evolutionary tech- 
niques of structure optimization become appli- 
cable. 

2. The model allows the combination of various 
learning methods within a single framework. 
Especially the idea of integrating unsupervised 
learning methods in a system that adapts su- 
pervised opens new perspectives. Many more 
techniques from elaborated learning theories can 
be transfered on our model. In principle, the 
uniformity of architecture representation would 
allow to specify freely where it is learned by 
which principles. 

3. The model overcomes the limitedness of conven- 
tional neural networks to perform task decom- 
position, i.e., to adapt in a decorrelated way to 
decorrelated data (Toussaint 2002). 
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1st subtask 2nd subtask 




Figure 5: The gating ratios for single trials for the 
four different learning schemes: The four rows refer 
to gradient, EM-, Q-, and Oja-Q-learning; and the 
two columns refer to the two classes of stimuli — one 
for the "identical" task, and one for the "not" task. 
Each graph displays two curves that sum to I and 
indicate how often the first or second gating neuron 
wins in case of the respective subtask. 
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